Pandas Tutorial

Software Carpentry, EITN, Paris, November 20th, 2015

Bartosz Teleńczuk

forked from the tutorial at EuroScipy 2015 by Joris Van den Bossche (Ghent University, Belgium)

Licensed under CC BY 4.0 Creative Commons

Content of this talk

Why do you need pandas?
Basic introduction to the data structures
Guided tour through some of the pandas features with two case studies: movie database and a case study about air quality

If you want to follow along, this is a notebook that you can view or run yourself:

All materials (notebook, data): https://github.com/btel/2015_eitn_swc_pandas
You need pandas >= 0.15.2 (easy solution is using Anaconda)

Some imports:



In [7]:

    
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.options.display.max_rows = 8

Let's start with a showcase

Case study: air quality in Europe

AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe

Starting from these hourly data for different stations:



In [2]:

    
data = pd.read_csv('data/airbase_data.csv', index_col=0, parse_dates=True, na_values='-9999')



In [3]:

    
data









    Out[3]:






  
    
      
      BETR801
      BETN029
      FR04037
      FR04012
    
  
  
    
      1998-01-01 00:00:00
      NaN
      16.0
      NaN
      NaN
    
    
      1998-01-01 01:00:00
      NaN
      13.0
      NaN
      NaN
    
    
      1998-01-01 02:00:00
      NaN
      12.0
      NaN
      NaN
    
    
      1998-01-01 03:00:00
      NaN
      12.0
      NaN
      NaN
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      2012-12-31 20:00:00
      16.5
      2.0
      16
      47
    
    
      2012-12-31 21:00:00
      14.5
      2.5
      13
      43
    
    
      2012-12-31 22:00:00
      16.5
      3.5
      14
      42
    
    
      2012-12-31 23:00:00
      15.0
      3.0
      13
      49
    
  

131265 rows × 4 columns

to answering questions about this data in a few lines of code:

Does the air pollution show a decreasing trend over the years?



In [4]:

    
data['1999':].resample('A').plot(ylim=[0,100])









    Out[4]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f458c4c4f28>

How many exceedances of the limit values?



In [5]:

    
exceedances = data > 200
exceedances = exceedances.groupby(exceedances.index.year).sum()
ax = exceedances.loc[2005:].plot(kind='bar')
ax.axhline(18, color='k', linestyle='--')









    Out[5]:





<matplotlib.lines.Line2D at 0x7f458c1d15c0>

What is the difference in diurnal profile between weekdays and weekend?



In [6]:

    
data['weekday'] = data.index.weekday
data['weekend'] = data['weekday'].isin([5, 6])
data_weekend = data.groupby(['weekend', data.index.hour])['FR04012'].mean().unstack(level=0)
data_weekend.plot()









    Out[6]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f458c0ca0f0>

We will come back to these example, and build them up step by step.

Why do you need pandas?

When working with tabular or structured data (like R dataframe, SQL table, Excel spreadsheet, ...):

Import data
Clean up messy data
Explore data, gain insight into data
Process and prepare your data for analysis
Analyse your data (together with scikit-learn, statsmodels, ...)

Pandas: data analysis in python

For data-intensive work in Python the Pandas library has become essential.

What is pandas?

Pandas can be thought of as NumPy arrays with labels for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that.
Pandas can also be thought of as R's data.frame in Python.

It's documentation: http://pandas.pydata.org/pandas-docs/stable/

Key features

Fast, easy and flexible input/output for a lot of different data formats
Working with missing data (.dropna(), pd.isnull())
Merging and joining (concat, join)
Grouping: groupby functionality
Reshaping (stack, pivot)
Powerful time series manipulation (resampling, timezones, ..)
Easy plotting

How can you help?

We need you!

Contributions are very welcome and can be in different domains:

reporting issues
improving the documentation
testing release candidates and provide feedback
triaging and fixing bugs
implementing new features
spreading the word

-> https://github.com/pydata/pandas

	BETR801	BETN029	FR04037	FR04012
1998-01-01 00:00:00	NaN	16.0	NaN	NaN
1998-01-01 01:00:00	NaN	13.0	NaN	NaN
1998-01-01 02:00:00	NaN	12.0	NaN	NaN
1998-01-01 03:00:00	NaN	12.0	NaN	NaN
...	...	...	...	...
2012-12-31 20:00:00	16.5	2.0	16	47
2012-12-31 21:00:00	14.5	2.5	13	43
2012-12-31 22:00:00	16.5	3.5	14	42
2012-12-31 23:00:00	15.0	3.0	13	49

Pandas Tutorial

Software Carpentry, EITN, Paris, November 20th, 2015

Bartosz Teleńczuk

Content of this talk

Let's start with a showcase

Case study: air quality in Europe

Why do you need pandas?

Why do you need pandas?

Pandas: data analysis in python

Key features

Further reading

How can you help?